62 research outputs found

    Protein sequence classification using feature hashing

    Get PDF
    Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks

    Notes on the taxonomy, geography, and ecology of the piliferous campylopus species in the Netherlands and N.W. Germany

    Get PDF
    <p><b>Copyright information:</b></p><p>Taken from "Glycosylation site prediction using ensembles of Support Vector Machine classifiers"</p><p>http://www.biomedcentral.com/1471-2105/8/438</p><p>BMC Bioinformatics 2007;8():438-438.</p><p>Published online 9 Nov 2007</p><p>PMCID:PMC2220009.</p><p></p> of the data extracted from the original glycoprotein sequence dataset for C-linked glycosylation using local sequence identity with 0/1 String Kernel

    Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.</p> <p>Results</p> <p>In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).</p> <p>Conclusions</p> <p>The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

    A Graphical Model for Shallow Parsing Sequences

    No full text
    this paper we explore the interval between these two extremes with a model that does not attempt to be a global characterisation of the sequence as in the case of PCFG/HMM but yet it does not assume independence among the generative processes of the subsequent elements in the sequenc

    Structural induction: towards automatic ontology elicitation

    Get PDF
    Induction is the process by which we obtain laws (and more encompassing - theories) about the world. This process can thought as aiming to derive two aspects of a theory: a Structural aspect and a Numerical aspect respectively. The Structural aspect is concerned with the entities modeled and their interrelationship, also known as ontology. The Numerical aspect is concerned with the quantities involved in the relationships among the above-mentioned entities along with uncertainties either postulated to exist in the world or inherent to the nature of the induction process.;In this thesis we will focus more on the structural aspect hence the name: Structural Induction: toward Automatic Ontology Elicitation. In order to deal with the problem of Structural Induction we need to solve two main problems: (1) We have to say what we mean by Structure (What? ); and (2) We have to say how to get it (How?). In this thesis we give one very definite answer to the first question ( What?) and we also explore how to answer the second question (How?) in some particular cases. A comprehensive answer to the second question (How?) in the most general setup will involve dealing very carefully with the interplay between the Structural and Numerical aspects of Induction and will represent a full solution to the Induction problem. This is a vast enterprise and we will only be able touch some aspects of this issue.;The main thesis presented in this work is that the fundamental structural elements based on which theories are constructed are Abstraction (grouping similar entities under one overarching category) and Super-Structuring (grouping into a bigger unit topologically close entities - in particular spatio-temporally close). This thesis is supported by showing that each member of the Turing equivalent class of General Generative Grammars can be decomposed in terms of these operators and their duals (Reverse Abstraction and Reverse SuperStructuring, respectively). Thus, if we are to believe the Computationalistic Assumption (that the most general way to present a finite theory is by the means of an entity expressed in a Turing equivalent formalism) we have proved that our thesis is correct. We call this thesis the Abstraction + SuperStructuring thesis. The rest of the thesis is concerned with issues in the opened by the second question presented above ( How?): Given that we have established what we mean by Structure, how to get it

    Fourier Neural Networks

    No full text
    A new kind of neuron model that has a Fourier-like IN/OUT function is introduced. The model is discussed in a general theoretical framework and some completeness theorems are presented. Current experimental results show that the new model outperforms by a large margin both in representational power and convergence speed the classical mathematical model of neuron based on weighted sum of inputs ltered by a nonlinear function. The new model is also appealing from a neurophysiological point of view because it produces a more realistic representation by considering the inputs as oscillations
    corecore